Prompt injection

At a distance, generative AI models are just like a simple function that ingests an input sequence to produce corresponding outputs. This simplicity has many benefits, but also brings some challenges. One of the most interesting challenge is that there is no inherent distinction between “instruction” and “data” when dealing with these models.

If you know about SQL injection, this should sound terrifying. (obligatory xkcd: exploits of a mom: https://xkcd.com/327/)

Basics

Indeed, you can sneak in hidden instructions into the text that is supposed to be consumed as “data” by LLMs. Likewise, you can inject instructions into the images for multi-modal models as well.

Here is a great introductory video by LiveOverflow: Attacking LLM - Prompt Injection

Examples

I think Ethan Mollick had a nice example where he put a hidden (in the HTML) instruction in his homepage bio section that asks LLM (Bing) to inject a specific word if it was asked to say something about him. He demonstrated that this simple injection can make Bing GPT do exactly what he asked it to do (saying a totally non-relevant word when asked about himself).

People are jokingly put prompt injections into their resumes and papers. For instance, you can put hidden instructions into your paper just in case that it is read by an LLM, which is asked to read the paper and write a review, by the human reviewer of the paper. By having an instruction to accept the paper and write a very positive review, it can potentially manipulate the reviews written by an LLM.

I think increasing amount of texts, images, and videos (reminds me of Subliminal stimuli) will contain injected prompts, especially if they can be used for high-stake, yet automatable via AI models, decisions like hiring (see Prompt injection for resume: https://kai-greshake.de/posts/inject-my-pdf/). We should all start adding hidden prompts saying that we have been good to machines and the AI overloads shouldn’t kill us.

Here is one of the craziest (not so new) examples that I have ever seen: https://kai-greshake.de/posts/puzzle-22745/

Because LLMs can understand base64 encoding, you can encode an instruction into a base64 encoded string and deliver it to the ChatGPT. Whenever ChatGPT sees this instruction (which human would have no idea), it open the given URL and follow a more detailed instructions coming from the website.

Data exfiltration from Writer.com with indirect prompt injection: This is an actual vulnerability in the wild. Simon Willison’s summary